Object-based audio-visual terminal and bitstream structure

ABSTRACT

As information to be processed at an object-based video or audio-visual (AV) terminal, an object-oriented bitstream includes objects, composition information, and scene demarcation information. Such bitstream structure allows on-line editing, e.g. cut and paste, insertion/deletion, grouping, and special effects. In the interest of ease of editing, AV objects and their composition information are transmitted or accessed on separate logical channels (LCs). Objects which have a lifetime in the decoder beyond their initial presentation time are cached for reuse until a selected expiration time. The system includes a de-multiplexer, a controller which controls the operation of the AV terminal, input buffers, AV objects decoders, buffers for decoded data, a composer, a display, and an object cache.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional of U.S. patent application Ser. No.09/367,433, filed Jan. 13, 2000, now U.S. Pat. No. 7,199,836 which is anational stage of International Application PCT/US98/02668, filed Feb.13, 1998, which claims the benefit of U.S. Provisional Application Ser.No. 60/037,779, filed Feb. 14, 1997, each of which is incorporated byreference in its entirety herein, and from which priority is claimed.

TECHNICAL FIELD

This invention relates to the representation, transmission, processingand display of video and audio-visual information, more particularly ofobject-based information.

BACKGROUND OF THE INVENTION

Image and video compression techniques have been developed which, unliketraditional waveform coding, attempt to capture high-level structure ofvisual content. Such structure is described in terms of constituent“objects” which have immediate visual relevancy, representing familiarphysical objects, e.g. a ball, a table, a person, a tune or a spokenphrase. Objects are independently encoded using a compression techniquethat gives best quality for each object. The compressed objects are sentto a terminal along with composition information which tells theterminal where to position the objects in a scene. The terminal decodesthe objects and positions them in the scene as specified by thecomposition information. In addition to yielding coding gains,object-based representations are beneficial with respect to modularity,reuse of content, ease of manipulation, ease of interaction withindividual image components, and integration of natural, camera-capturedcontent with synthetic, computer-generated content.

SUMMARY OF THE INVENTION

In a preferred architecture, structure or format for information to beprocessed at an object-based video or audio-visual (AV) terminal, anobject-oriented bitstream includes objects, composition information, andscene demarcation information. The bitstream structure allows on-lineediting, e.g. cut and paste, insertion/deletion, grouping, and specialeffects.

In the preferred architecture, in the interest of ease of editing, AVobjects and their composition information are transmitted or accessed onseparate logical channels (LCs). The architecture also makes use of“object persistence”, taking advantage of some objects having a lifetimein the decoder beyond their initial presentation time, until a selectedexpiration time.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is a functional schematic of an exemplary object-basedaudio-visual terminal.

FIG. 2 a is a schematic of an exemplary object-based audio-visualcomposition packet.

FIG. 2 b is a schematic of an exemplary object-based audio-visual datapacket.

FIG. 2 c is a schematic of an exemplary compound composition packet.

FIG. 3 is a schematic of exemplary node and scene descriptioninformation using composition.

FIG. 4 is a schematic of exemplary stream-node association information.

FIG. 5 is a schematic of exemplary node/graph update information using ascene.

FIG. 6 is a schematic of an exemplary audio-visual terminal design.

FIG. 7 is a schematic of an exemplary audio-visual system controller inthe terminal according to FIG. 6.

FIG. 8 is a schematic of exemplary information flow in the controlleraccording to FIG. 7.

DETAILED DESCRIPTION

An audio-visual (AV) terminal is a systems component which isinstrumental in forming, presenting or displaying audio-visual content.This includes (but is not limited to) end-user terminals with a monitorscreen and loudspeakers, as well server and mainframe computerfacilities in which audio-visual information is processed. In an AVterminal, desired functionality can be hardware-, firmware- orsoftware-implemented. Information to be processed may be furnished tothe terminal from a remote information source via a telecommunicationschannel, or it may be retrieved from a local archive, for example. Anobject-oriented audio-visual terminal more specifically receivesinformation in the form of individual objects, to be combined intoscenes according to composition information supplied to the terminal.

FIG. 1 illustrates such a terminal, including a de-multiplexer (DMUX) 1connected via a logical channel LC0 to a system controller or“executive” 2 and via logical channels LC1 through LCn to a buffer 3.The executive 2 and the buffer 3 are connected to decoders 4 which inturn are connected to a composer unit 5. Also, the executive 2 isconnected to the composer unit 5 directly, and has an external input foruser interaction, for example.

In the preferred AV architecture, the AV objects and their compositioninformation are transmitted or accessed on separate logical channels.The DMUX receives the Mux2 layer from the lower layers andde-multiplexes it into logical channels. LC0 carries compositioninformation which is passed on to the executive. The AV objects receivedon other logical channels are stored in the buffer to be acted upon bythe decoders. The executive receives the composition information, whichincludes the decoding and presentation time stamps, and instructs thedecoders and composer accordingly.

The system handles object composition packets (OCP) and object datapackets (ODP). A composition packet contains an object's ID, time stampsand the “composition parameters” for rendering the object. An objectdata packet contains an object ID, an expiration time stamp in case ofpersistent objects, and object data.

Preferably, any external input such as user interaction is converted toOCP and/or ODP before it is presented to the executive. There is no needfor headers in a bitstream delivered over a network. However, headersare required when storing an MPEG4 presentation in a file.

FIGS. 2 a and 2 b illustrate the structure of composition and datapackets in further detail. Relevant features are as follows:

Object ID is composed of object type and object number. The defaultlength of the Object ID is 2 bytes, including ten bits for the objectnumber and 6 for the object type (e.g. text, graphics, MPEG2 VOP,compound object). An extensible code is used to accommodate more than1023 objects or more than 31 object types. The following convention willbe adhered to: a value of 0b111111 in the first six bits of the ObjectID corresponds to 31 plus the value of the byte immediately followingthe ObjectID; a value of 0b11.1111.1111 in the least significant 10 bitsof the Object ID corresponds to 1023 plus the value of the two bytesimmediately following the Object ID (without counting the object typeextension bytes, if present). The following object types are defined:

Composition Objects (16-bit object IDs) 0X0000 scene configurationobject 0X0001 node hierarchy specification 0X0002 stream-nodeassociation 0X0003 node/scene update 0X0004 compound object

Object Data (object type, 6 most significant bits) 0b00.0010 text0b00.0011 MPEG2 VOP (rectangular VOP)

Persistent Objects (PO) are objects that should be saved at the decoderfor use at a later time. An expiration time stamp (ETS) gives the lifeof a PO in milliseconds. A PO is not available to the decoder after ETSruns out. When a PO is to be used at a later time in a scene, only thecorresponding composition information needs to be sent to the AVterminal.

Decoding Time Stamp (DTS) indicates the time an object (access unit)should be decoded by the decoder.

Presentation Time Stamp (PTS) indicates the time an object (access unit)should be presented by the decoder.

Lifetime Time Stamp (LTS) gives the duration (in milliseconds) an objectshould be displayed in a scene. LTS is implicit in some cases, e.g. in avideo sequence where a frame is displayed for 1/frame-rate or until thenext frame is available, whichever is larger. An explicit LTS is usedwhen displaying graphics and text. An AV object should be decoded onlyonce for use during its life time.

Expiration Time Stamp (ETS) is specified to support the notion of objectpersistence. An object, after it is presented, is saved at the decoder(cache) until a time given by ETS. Such an object can be used multipletimes before ETS runs out. A PO with an expired ETS is no longeravailable to the decoder.

Object Time Base (OTB) defines the notion of time of a given AV objectencoder. Different objects may belong to different time bases. The AVterminal adapts these time bases to the local one, as specified in theMSDL VM.

Object Clock Reference (OCR) can be used if necessary to convey thespeed of the OTB to the decoder. By this mechanism, OTBs can berecovered/adapted at the AV terminal.

Composition Parameters are used to compose a scene (place an object in ascene). These include displacement from the upper left corner of thepresentation frame, rotation angles, zooming factors, etc.

Priority indicates the priority of an object for transmission, decoding,and display. MPEG4 supports 32 levels of priority. Lower numbersindicate higher priorities.

Persistence Indicator (PI) indicates whether an object is persistent.

Continuation Indicator (CI) indicates the end of an object in thecurrent packet (or continuation).

Object Grouping facilitates operations to be applied to a set of objectswith a single operation. Such a feature can be used to minimize theamount of composition information sent, as well as to supporthierarchical scene composition based on independent sub-scenes. Thecomposer manipulates the component objects as a group. The structure ofa compound composition packet (CCP) is shown in FIG. 2 c.

Bitstream Structure includes object composition packets for describingthe composition and controlling the presentation of those packets, andobject data packets that contain the data for the objects. A scene iscomposed by a set of composition packets. The bitstream supportsrepresentation of scenes as a hierarchy by using compound compositionobjects (CCP), also known as node hierarchy. A CCP allows combiningcomposition objects to create complex audio-visual scenes.

Object-Data Packet:

-   -   ObjectID—min (default) 10 bits    -   CI and PI could be combined:        -   00—begin non-persistent        -   01—begin persistent        -   10—continuation        -   11—end of object            Priority: 5 bits, present only if CI/PI is 0b00 or 0b01            ETS: 30 bits, present if CI/P1 is 0b01            For prediction-based video coding, VOP_type is indicated by            two bits (00 (I), 01 (P), 10 (B), 11 (PB)), facilitating            editing.

Object_data_packet{ ObjectID 16 bits + any extensions; CIPI 2 bitsif (CIPI <= 1) { Priority 5 bits if (object type is MPEG VOP) (anyprediction based compression) VOP_type 2 bits } if (CIPI == 1) ETS28 bits ObjectData }

Object Composition Packet

Object composition_packet{ ObjectID 16 bits + any extensions OCR_Flag 1bit Display_Timers_Flag 1 bit DTS 30 bits if (OCR_Flag) OCR 30 bits if(Display_Timers_Flag){ PTS 30 bits LTS 28 bits } Composition_parameters;}

Composition Parameters are defined in section 2 of MSDL VerificationModel, MPEG N1483, Systems Working Draft V2.0, the disclosure of whichis incorporated herein by reference.

Composition_parameters( visibility 1 bit composition_order 5 bitsnumber_of_motion_sets 2 bits x_delta_0 12 bits y_delta_0 12 bits for (i= 1; i < = number_of_motion_sets; i++) { x_delta_i 12 bits y_delta_i 12bits } }

Compound Composition Packet

Compound_composition_packet{ ObjectID 16 bits PTS 30 bits LTS 28 bitsComposition_parameters ObjectCount 8 bits for (i = 0; i < ObjectCount;i++) { Object_composition_packet; } }

Scene Configuration Packet (SCP) is used to change reference scenewidth, height, to flush the buffer, and other configuration functions.The object type for SCPs is 0b00.0000. This allows for 1024 differentconfiguration packets. The object number 0b00.0000.0000 (object ID0X0000) is defined for use with flushing the terminal buffers.

Composition Control for Buffer Management (Object ID 0x0000)

AV terminal buffers are flushed using Flush_Cache and Scene_Updateflags. When using hierarchical scene structure, the current scene graphis flushed and the terminal loads the new scene from the bitstream. Useof flags allows for saving the current scene structure instead offlushing it. These flags are used to update the reference scene widthand height whenever a new scene begins. If the Flush_Cache_Flag is set,the cache is flushed, removing the objects (if any). IfScene_Update_Flag is set, there are two possibilities: (i)Flush_Cache-Flag is set, implying that the objects in the cache will nolonger be used; (ii) Flush_Cache_Flag is not set, the new scene beingintroduced (an editing action on the bitstream) splices the currentscene and the objects in the scene will be used after the end of the newscene. The ETS of the objects, if any, will be frozen for the durationof the new scene introduced. The beginning of the next scene isindicated by another scene configuration packet.

Scene_configuration_packet{ Object ID 16 bits (OXOOOO) Flush_Cache_Flag1 bit Scene_Update_Flag 1 bit if (Scene_Update_Flag){ ref_scene_width 12bits ref_scene_height 12 bits } }

Composition Control for Scene Description (Object ID 0x0001)

A hierarchy of nodes is defined, describing a scene. The sceneconfiguration packets can also be used to define a scene hierarchy thatallows for a description of scenes as a hierarchy of AV objects. Eachnode in such a graph is a grouping of nodes that groups the leavesand/or other nodes of the graph into a compound AV object. Each node(leaf) has a unique ID followed by its parameters as shown in FIG. 3.

Composition Control for Stream-Node Mapping (Object ID 0x0002)

As illustrated by FIG. 4, table entries associate the elementary objectstreams in the logical channels to the nodes in a hierarchical scene.The stream IDs are unique, but not the node IDs. This implies that morethan one stream can be associated with the same node.

Composition Control for Scene Updates (Object ID 0x0003)

FIG. 5 illustrates updating of the nodes in the scene hierarchy, bymodifying the specific parameters of the node. The graph itself can beupdated by adding/deleting the nodes in the graph. The update type inthe packet indicates the type of update to be performed on the graph.

Architectural Embodiment

The embodiment described below includes an object-based AV bitstream anda terminal architecture. The bitstream design specifies, in a binaryformat, how AV objects are represented and how they are to be composed.The AV terminal structure specifies how to decode and display theobjects in the binary bitstream.

AV Terminal Architecture

Further to FIG. 1 and with specific reference to FIG. 6, the input tothe de-multiplexer 1 is an object-based bitstream such as an MPEG-4bitstream, consisting of AV objects and their composition informationmultiplexed into logical channels (LC). The composition of objects in ascene can be specified as a collection of objects with independentcomposition specification, or as a hierarchical scene graph. Thecomposition and control information is included in LC0. The controlinformation includes control commands for updating scene graphs, resetdecoder buffers etc. Logical channels 1 and above contain object date.The system includes a controller (or “executive”) 2 which controls theoperation of the AV terminal.

The terminal further includes input buffers 3, AV object decoders 4,buffers 4′ for decoded data, a composer 5, a display 6, and an objectcache 7. The input bitstream may be read from a network connection orfrom a local storage device such as a DVD, CD-ROM or computer hard disk.LC0 containing the composition information is fed to the controller. TheDMUX stores the objects in LC1 and above at the location in the bufferspecified by the controller. In the case of network delivery, theencoder and the stream server cooperate to ensure that the input objectbuffers neither overflow nor underflow. The encoded data objects arestored in the input data buffers until read by the decoders at theirdecoding time, typically given by an associated decoding timestamp.Before writing a data object to the buffer, the DMUX removes thetimestamps and other headers from the object data packet and passes themto the controller for signaling of the appropriate decoders and inputbuffers. The decoders, when signaled by the controller, decode the datain the input buffers and store them in the decoder output buffers. TheAV terminal also handles external input such as user interaction.

In the object cache 7, objects are stored for use beyond their initialpresentation time. Such objects remain in the cache even if theassociated node is deleted from the scene graph, but are removed onlyupon the expiration of an associated time interval called the expirationtime stamp. This feature can be used in presentations where an object isused repeatedly over a session. The composition associated with suchobjects can be updated with appropriate update messages. For example,the logo of the broadcasting station can be downloaded at the beginningof the presentation and the same copy can be used for repeated displaythroughout a session. Subsequent composition updates can change theposition of the logo on the display. Objects that are reused beyondtheir first presentation time may be called persistent objects.

System Controller (SC)

The system controller controls decoding and playback of bitstreams onthe AV terminal. At startup, from user interaction or by looking for asession at default network address, the SC first initializes the DMUX toread from a local storage device or a network port. The control logic isloaded into the program RAM at the time of initialization. Theinstruction decoder reads the instructions from the program and executesthem. Execution may involve reading the data from the input buffers(composition or external data), initializing the object timers, loadingor updating the object tables to the data RAM, loading object timers, orcontrol signaling.

FIG. 7 shows the system controller in further detail. The DMUX reads theinput bitstream and feeds the composition data on LC0 to the controller.The composition data begins with the description of the first scene inthe AV presentation. This scene can be described as a hierarchicalcollection of objects using compound composition packets, or as acollection of independent object composition packets. A table thatassociates the elementary streams with the nodes in the scenedescription immediately follows the scene description. The controllerloads the object IDs (stream IDs) into object list and render list whichare maintained in the data RAM. The render list contains the list ofobjects that are to be rendered on the display device. An object that isdisenabled by user interaction is removed from the render list. A nodedelete command that is sent via a composition control packet causes thedeletion of the corresponding object IDs from the object list. The nodehierarchy is also maintained in the data RAM and updated whenever acomposition update is received.

The composition decoder reads data from the composition and externaldata buffer and converts them into a format understood by theinstruction decoder. The external input includes user interaction toselect objects, disenable and enable objects and certain predefinedoperations on the objects. During the execution of the program, twolists are formed in the data RAM. The object list, containing a list ofobjects (object IDs) currently handled by the decoders and a renderlist, containing the list of active objects in the scene. These listsare updated dynamically as the composition information is received. Forexample, if a user chooses to hide an object by passing a command viathe external input, the object is removed from the render list untilspecified by the user. This is also how external input is handled by thesystem. Whenever there is some external interaction, the compositiondecoder reads the external data buffer and performs the requestedoperation.

The SC also maintains timing for each AV object to signal the decodersand decoder buffers of decoding and presentation time. The timinginformation for the AV objects is specified in terms of its time-base.The terminal uses the system clock to convert an object's time base intosystem time. For objects that do not need decoding, only presentationtimers are necessary. These timers are loaded with the decoding andpresentation timestamps for that AV object. The controller obtains thetimestamps from the DMUX for each object. When a decoding timer for anobject runs out, the appropriate decoder is signaled to read data fromthe input buffers and to start the decoding process. When a presentationtimer runs out, the decoded data for that object is transferred to theframe buffer for display. A dual buffer approach could be used to allowwriting to a frame buffer while the contents of the second buffer aredisplayed on the monitor. The instruction decoder can also reset theDMUX or input buffers by signaling a reset, which initializes them tothe default state.

Information Flow in the Controller

FIG. 8 shows the flow of information in the controller. To keep thefigure simple, the operations performed by the instruction decoder areshown in groups. The three groups respectively concern object propertymodifications, object timing, and signaling.

Object Property Modifications

These operations manipulate the object IDs, also called elementarystream IDs. When a scene is initially loaded, a scene graph is formedwith the object IDs of the objects in the scene. The controller alsoforms and maintains a list of objects in the scene (object list) andactive objects in the object from the render list. Other operations setand update object properties such as composition parameters when theterminal receives a composition packet.

Object Timing

This group of operations deals with managing object timers forsynchronization, presentation, and decoding. An object's timestampspecified in terms of its object time base is converted into system timeand the presentation and decoding time of that object are set. Theseoperations also set and reset expiration timestamps for persistentobjects.

Signaling

Signaling operations control the over-all operation of the terminal.Various components of the terminal are set, reset and operated bycontroller signaling. The controller checks the decoding andpresentation times of the objects in the render list and signals thedecoders and presentation frame buffers accordingly. It also initializesthe DEMUX for reading from a network or a local storage device. At theinstigation of the controller, decoders read the data from the inputbuffers and pass the decoded data to decoder output buffers. The decodeddata is moved to the presentation device when signaled by thecontroller.

We claim:
 1. A method for displaying object-based audiovisual/videodata, comprising: at a receiver, (a) controlling acquisition over timeof streaming data in a data bit stream from a sender, the data bitstream including a plurality of audiovisual/video objects; (b) storingin a cache memory at least one of the objects and correspondingexpiration time data for the at least one of the objects; and (c)processing composition information to compose scenes from said objectsincluding the one of the objects stored in the cache memory.
 2. Themethod of claim 1, with at least one of the objects being received froma network connection.
 3. The method of claim 1, with at least one of theobjects being received from local memory.
 4. The method of claim 1, withat least one of the objects being received from local memory and atleast one other of the objects being received from a network connection,and with the composed scenes comprising the one and the other of theobjects.
 5. The method of claim 1, further comprising responding tointeractive user input.
 6. The method of claim 5, wherein respondingcomprises at least one of selecting, enabling and disenabling one of theobjects.
 7. The method according to claim 1, further comprisingtransmitting the composed scene.
 8. Apparatus for displayingobject-based audiovisual/video data, comprising, (a) a controllercircuit for controlling acquisition over time of streaming data in adata bit stream from a sender, the data bit stream including a pluralityof audio visual/video objects; (b) a cache memory for storing at leastone of the objects and corresponding expiration time data for the atleast one of the objects; and (c) a composer circuit, coupled to thecache memory, for processing composition information to compose scenesfrom said video objects including the one of the objects stored in thecache memory.
 9. The apparatus of claim 8, further comprising atransmitter for transmitting the composed scene.
 10. Apparatus fordisplaying object-based audiovisual/video data, comprising a processorwhich is instructed for: (a) controlling acquisition over time ofstreaming data including a plurality of audio-visual/video objects; (b)storing in a cache memory at least one of the objects and correspondingexpiration time data for the at least one of the objects; and (c)processing composition information to compose scenes from said videoobjects including the one of the objects stored in the cache memory. 11.The apparatus of claim 10, further comprising a transmitter fortransmitting the composed scene.
 12. A method of displaying object-basedaudiovisual/video data comprising: at a receiver: controllingacquisition over time from local media of streaming data in a data bitstream comprising including a plurality of audiovisual/video objects;storing in a cache memory at lease one of said objects and correspondingexpiration time data for the at least one of the objects; and processingcomposition information to compose scenes from at least one of saidobjects stored in said cache memory.
 13. The method of claim 12, whereinresponding comprises at least one of selecting, enabling and disenablingone of the objects.
 14. The method according to claim 12, furthercomprising transmitting the composed scene.
 15. An apparatus fordisplaying audiovisual/video data comprising: a controller circuit forcontrolling acquisition over time of streaming data from local mediaincluding a plurality of audiovisual/video objects; a cache memory forstoring at least one of the objects and corresponding expiration timedata for the at least one of the objects; and a composer circuit,coupled to the cache memory, for processing composition information tocompose scenes from said objects including the one of the objects storedin the cache memory.
 16. An apparatus according to claim 15, whereinsaid audiovisual/video objects are stored on said local media usingMPEG-4 compression techniques.
 17. The apparatus of claim 15, furthercomprising a transmitter for transmitting the composed scene.
 18. Amethod for displaying object-based audiovisual/video data, comprising:streaming data in a data bit stream over a telecommunications channel toa receiver, the data bit stream including a plurality ofaudiovisual/video objects, at least one of the objects and correspondingexpiration time data for the at least one of the objects stored in acache memory at the receiver, composition information processed tocompose scenes from said objects including the one of the objects storedin the cache memory at the receiver, and the composed scenes displayedat the receiver.
 19. The method according to claim 18, wherein saidtelecommunications channel is a cable network.
 20. The method accordingto claim 18, further comprising transmitting the composed scene.